Channel sizing for inter-kernel communication

ABSTRACT

Systems and methods for dynamically sizing inter-kernel communication channels implemented on an integrated circuit (IC) are provided. Implementation characteristics of the channels, predication, and kernel scheduling imbalances may factor into properly sizing the channels for self-synchronization, resulting in optimized steady-state throughput.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/328,878, entitled “Channel Sizing for Inter-Kernel Communication,”filed May 24, 2021, which is a continuation of U.S. patent applicationSer. No. 14/749,379, entitled “Channel Sizing for Inter-KernelCommunication,” filed on Jun. 24, 2015, now U.S. Pat. No. 11,016,742,each of which is incorporated by reference herein in its entirety forall purposes.

BACKGROUND

The present disclosure relates generally to integrated circuits, such asfield programmable gate arrays (FPGAs). More particularly, the presentdisclosure relates to dynamic sizing of channels used for kernelcommunication on integrated circuits (e.g., FPGAs).

This section is intended to introduce the reader to various aspects ofart that may be related to various aspects of the present disclosure,which are described and/or claimed below. This discussion is believed tobe helpful in providing the reader with background information tofacilitate a better understanding of the various aspects of the presentdisclosure. Accordingly, it should be understood that these statementsare to be read in this light, and not as admissions of prior art.

Integrated circuits (ICs) take a variety of forms. For instance, fieldprogrammable gate arrays (FPGAs) are integrated circuits that areintended as relatively general-purpose devices. FPGAs may include logicthat may be programmed (e.g., configured) after manufacturing to provideany desired functionality that the FPGA is designed to support. Thus,FPGAs contain programmable logic, or logic blocks, that may beconfigured to perform a variety of functions on the FPGAs, according toa designer's design. Additionally, FPGAs may include input/output (I/O)logic, as well as high-speed communication circuitry. For instance, thehigh-speed communication circuitry may support various communicationprotocols and may include high-speed transceiver channels through whichthe FPGA may transmit serial data to and/or receive serial data fromcircuitry that is external to the FPGA.

In ICs such as FPGAs, the programmable logic is typically configuredusing low level programming languages such as VHDL or Verilog.Unfortunately, these low level programming languages may provide a lowlevel of abstraction and, thus, may provide a development barrier forprogrammable logic designers. Higher level programming languages, suchas OpenCL have become useful for enabling more ease in programmablelogic design. The higher level programs are used to generate codecorresponding to the low level programming languages. As used herein,kernels refer to a digital circuit that implements a specific functionand/or program. Kernels may be useful to bridge the low levelprogramming languages into executable instructions that may be performedby the integrated circuits. Each kernel implemented on the IC mayexecute independently and concurrently from the other kernels on the IC.Accordingly, OpenCL programs typically require at least a singlehardware implementation for each kernel in the OpenCL program. Kernelsmay be individually balanced and data may flow from one kernel toanother using one or more dataflow channels (e.g., First-in-first-out(FIFO) channels) between two kernels.

The dataflow channels may be varied in size to accept an appropriateamount of data to flow from one kernel to another. Traditionally, usersspecify a data capacity for the channels to account for a constrainedexecution model (e.g., single-treaded execution). Unfortunately, thisuser-specified capacity does not account for implementation details,because users typically only work with the higher level programs ratherthan the low level programming languages.

SUMMARY

A summary of certain embodiments disclosed herein is set forth below. Itshould be understood that these aspects are presented merely to providethe reader with a brief summary of these certain embodiments and thatthese aspects are not intended to limit the scope of this disclosure.Indeed, this disclosure may encompass a variety of aspects that may notbe set forth below.

Present embodiments relate to systems, methods, and devices forenhancing performance of machine-implemented programs through automaticinter-kernel channel sizing based upon one or more factors. Inparticular, the present embodiments may provide dynamic channel sizingon integrated circuits (ICs, such as FPGAs) based upon the currentimplementation on the IC, predication, and/or scheduling imbalances. Theautomatic sizing may aim to increase data throughput between kernelexecutions.

Various refinements of the features noted above may exist in relation tovarious aspects of the present disclosure. Further features may also beincorporated in these various aspects as well. These refinements andadditional features may exist individually or in any combination. Forinstance, various features discussed below in relation to one or more ofthe illustrated embodiments may be incorporated into any of theabove-described aspects of the present disclosure alone or in anycombination. Again, the brief summary presented above is intended onlyto familiarize the reader with certain aspects and contexts ofembodiments of the present disclosure without limitation to the claimedsubject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

Various aspects of this disclosure may be better understood upon readingthe following detailed description and upon reference to the drawings inwhich:

FIG. 1 is a block diagram of a system that utilizes channel sizing logicto affect a machine-implemented program, in accordance with anembodiment;

FIG. 2 is a block diagram of a programmable logic device that mayinclude logic useful for implementing the channel sizing logic, inaccordance with an embodiment;

FIG. 3 is a block diagram illustrating elements of the host andintegrated circuit of FIG. 1 , in accordance with an embodiment;

FIG. 4 is a block diagram illustrating inter-kernel communication usinga plurality of automatically sized channels, in accordance with anembodiment;

FIG. 5 is a block diagram illustrating automatic sizing of channelsbased upon predication, in accordance with an embodiment;

FIG. 6 is a block diagram illustrating automatic sizing of channelsbased upon scheduling imbalances, in accordance with an embodiment;

FIG. 7 is a process for solving an integer linear programming problem todetermine an automatic channel depth, in accordance with an embodiment;and

FIG. 8 is a block diagram illustrating automatic sizing of channelsusing integer linear programming (ILP), in accordance with anembodiment.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

One or more specific embodiments will be described below. In an effortto provide a concise description of these embodiments, not all featuresof an actual implementation are described in the specification. Itshould be appreciated that in the development of any such actualimplementation, as in any engineering or design project, numerousimplementation-specific decisions must be made to achieve thedevelopers' specific goals, such as compliance with system-related andbusiness-related constraints, which may vary from one implementation toanother. Moreover, it should be appreciated that such a developmenteffort might be complex and time consuming, but would nevertheless be aroutine undertaking of design, fabrication, and manufacture for those ofordinary skill having the benefit of this disclosure.

As discussed in further detail below, embodiments of the presentdisclosure relate generally to circuitry for enhancing performance ofmachine-readable programs implemented on an integrated circuit (IC). Inparticular, inter-kernel communication channel sizing may beautomatically modified based upon one or more factors. For example,these modifications may be made based upon a current programimplementation on the IC, predication, and/or scheduling imbalances.

With the foregoing in mind, FIG. 1 illustrates a block diagram of asystem 10 that utilizes channel sizing logic to affect amachine-implemented program. As discussed above, a designer may desireto implement functionality on an integrated circuit 12 (IC, such as afield programmable gate array (FPGA)). The designer may specify a highlevel program to be implemented, such as an OpenCL program, which mayenable the designer to more efficiently and easily provide programminginstructions to implement a set of programmable logic for the IC 12without requiring specific knowledge of low level computer programminglanguages (e.g., Verilog or VHDL). For example, because OpenCL is quitesimilar to other high level programming languages, such as C++,designers of programmable logic familiar with such programming languagesmay have a reduced learning curve than designers that are required tolearn unfamiliar low level programming languages to implement newfunctionalities in the IC.

The designers may implement their high level designs using designsoftware 14, such as a version of Quartus by Altera™. The designsoftware 14 may use a compiler 16 to convert the high level program intoa low level program. Further, the compiler 16 (or other component of thesystem 10) may include channel sizing logic 17 that automatically sizeschannels that will be implemented for inter-kernel communicationsbetween two or more kernels.

The compiler 16 may provide machine-readable instructions representativeof the high level program to a host 18 and the IC 12. For example, theIC 12 may receive one or more kernel programs 20 which describe thehardware implementations that should be stored in the IC. Further,channel sizing definitions 21 may be provided by the channel sizinglogic 17, which may automatically define a sizing of channels betweenthe one or more kernel programs 20. As mentioned above, the automaticsizing may be based upon a variety of factors including: programimplementation, predication, and/or kernel scheduling imbalances. Sizingof the channels based upon these factors will be discussed in moredetail below.

The host 18 may receive a host program 22 which may be implemented bythe kernel programs 20. To implement the host program 22, the host 18may communicate instructions from the host program 22 to the IC 12 via acommunications link 24, which may be, for example, direct memory access(DMA) communications or peripheral component interconnect express (PCIe)communications. Upon receipt of the kernel programs 20 and the channelsizing definition 21, a kernel and/or channel implementation may beexecuted on the on the IC 16 and controlled by the host 18. As will bedescribed in more detail below, the host 18 may add, remove, or swapkernel programs 20 from the adapted logic 26, such that executionperformance may be enhanced.

Turning now to a more detailed discussion of the IC 12, FIG. 2illustrates an IC device 12, which may be a programmable logic device,such as a field programmable gate array (FPGA) 40. For the purposes ofthis example, the device 40 is referred to as an FPGA, though it shouldbe understood that the device may be any type of programmable logicdevice (e.g., an application-specific integrated circuit and/orapplication-specific standard product). As shown, FPGA 40 may haveinput/output circuitry 42 for driving signals off of device 40 and forreceiving signals from other devices via input/output pins 44.Interconnection resources 46, such as global and local vertical andhorizontal conductive lines and buses, may be used to route signals ondevice 40. Additionally, interconnection resources 46 may include fixedinterconnects (conductive lines) and programmable interconnects (i.e.,programmable connections between respective fixed interconnects).Programmable logic 48 may include combinational and sequential logiccircuitry. For example, programmable logic 48 may include look-uptables, registers, and multiplexers. In various embodiments, theprogrammable logic 48 may be configured to perform a custom logicfunction. The programmable interconnects associated with interconnectionresources may be considered to be a part of programmable logic 48. Asdiscussed in further detail below, the FPGA 40 may include adaptablelogic that enables partial reconfiguration of the FPGA 40, such thatkernels may be added, removed, and/or swapped during the runtime of theFPGA 40.

Programmable logic devices, such as FPGA 40, may contain programmableelements 50 with the programmable logic 48. For example, as discussedabove, a designer (e.g., a customer) may program (e.g., configure) theprogrammable logic 48 to perform one or more desired functions. By wayof example, some programmable logic devices may be programmed byconfiguring their programmable elements 50 using mask programmingarrangements, which is performed during semiconductor manufacturing.Other programmable logic devices are configured after semiconductorfabrication operations have been completed, such as by using electricalprogramming or laser programming to program their programmable elements50. In general, programmable elements 50 may be based on any suitableprogrammable technology, such as fuses, antifuses,electrically-programmable read-only-memory technology, random-accessmemory cells, mask-programmed elements, and so forth.

Most programmable logic devices are electrically programmed. Withelectrical programming arrangements, the programmable elements 50 may beformed from one or more memory cells. For example, during programming,configuration data is loaded into the memory cells 50 using pins 44 andinput/output circuitry 42. In one embodiment, the memory cells 50 may beimplemented as random-access-memory (RAM) cells. The use of memory cells50 based on RAM technology is described herein is intended to be onlyone example. Further, because these RAM cells are loaded withconfiguration data during programming, they are sometimes referred to asconfiguration RAM cells (CRAM). These memory cells 50 may each provide acorresponding static control output signal that controls the state of anassociated logic component in programmable logic 48. For instance, insome embodiments, the output signals may be applied to the gates ofmetal-oxide-semiconductor (MOS) transistors within the programmablelogic 48.

The circuitry of FPGA 40 may be organized using any suitablearchitecture. As an example, the logic of FPGA 40 may be organized in aseries of rows and columns of larger programmable logic regions, each ofwhich may contain multiple smaller logic regions. The logic resources ofFPGA 40 may be interconnected by interconnection resources 46 such asassociated vertical and horizontal conductors. For example, in someembodiments, these conductors may include global conductive lines thatspan substantially all of FPGA 40, fractional lines such as half-linesor quarter lines that span part of device 40, staggered lines of aparticular length (e.g., sufficient to interconnect several logicareas), smaller local lines, or any other suitable interconnectionresource arrangement. Moreover, in further embodiments, the logic ofFPGA 40 may be arranged in more levels or layers in which multiple largeregions are interconnected to form still larger portions of logic. Stillfurther, device arrangements may use logic that is arranged in a mannerother than rows and columns.

As discussed above, the FPGA 40 may allow a designer to create acustomized design capable of executing and performing customizedfunctionalities. Each design may have its own hardware implementation tobe implemented on the FPGA 40. For instance, a single hardwareimplementation is needed for each kernel in a design for the FPGA 40.Further, one or more channels may be implemented for inter-kernelcommunication. In some embodiments, these channels may include one ormore first-in-first-out (FIFO) buffers useful for data flow between twoor more kernels. The inter-kernel communication channels may beautomatically sized based upon a variety of factors, as described inmore detail below.

Referring now to FIG. 3 , a block diagram illustrating the system 10,further detailing elements of the host 18 and IC 12 of FIG. 1 isprovided. As illustrated, the IC 12 may include fixed components 60 andconfigurable components 62. Some ICs, such as a Stratix® V FPGA byAltera®, provide partial reconfiguration capabilities. For example, insome embodiments, the configurable components may include a number (N)of partial reconfiguration (PR) blocks 64 stored on an IC 12 (such asFPGA 40 of FIG. 2 ). The PR blocks 64 may prove an ability toreconfigure part of the IC 12 while the rest of the device continues towork. The PR blocks 64 may include ports to both on-chip memoryinterconnects and off-chip interconnects (ports 66 and 68,respectively). The PR blocks 64 are not restricted to a particularprotocol; however, each of the PR blocks 64 within an IC 12 may agree ona common protocol. For example, each of the PR blocks 64 may use theAvalon® Memory-Mapped (Avalon-MM) interface, which may allow easyinterconnection between components in the IC 12.

The size and number of PR blocks 64 may be defined by the hardwareimplementations and amount of programmable logic available on the IC 12.For example, the hardware implementations 26 for each kernel 20 and/orinter-kernel communication channel 21 may be placed in one or more PRblock 64. In certain embodiments, the hardware implementations 26 may beplaced in programmable logic that is not a partial reconfiguration block64. For example, the kernels 20 and/or the channel definitions (e.g.,channel sizing 21) may be provided by the compiler 16 (e.g., utilizingthe channel sizing logic 17 of FIG. 1 ).

Turning now to a discussion of the fixed logic 60, the fixed logic 60may include an on-chip memory interconnect 70, an arbitration network72, local memory 74, an off-chip interconnect 76, external memory andphysical layer controllers 78, and/or a PCIe bus 80. The on-chip memoryinterconnect 70 may connect to the PR blocks 64 over the on-chip memoryinterconnect ports 66 of the PR blocks 64. The on-chip memoryinterconnect 70 may facilitate access between the PR blocks 64 and thelocal memory 74 via the arbitration network 72. Further, the off-chipmemory interconnect 76 may connect to the PR blocks 64 over the off-chipmemory interconnect ports 68 of the PR blocks 64. The off-chipinterconnect 76 may facilitate communications between the PR blocks 64and the host communications components (e.g., the external memory andphysical layer controllers 78 and the PCIe bus 80). The external memoryand physical layer controllers 78 may facilitate access between the IC12 and external memory (e.g., host 18 memory 82). Further the PCIe bus80 may facilitate communication between the IC 12 and an externalprocessor (e.g., host 12 processor 84). As will become more apparent,based on the discussion that follows, communications between the host 18and the IC 12 may be very useful in enabling adaptable logic on the IC12.

FIG. 4 is a block diagram illustrating an example of a kernel andautomatically sized channel implementation 26, in accordance with anembodiment. The implementation 26 example of FIG. 4 includes threekernels 20A, 20B, and 20C. Output of kernel 20A is forwarded to kernel20B via two inter-kernel communication channels 100A and 100B. Further,output of kernel 20B is provided to kernel 20C via inter-kernelcommunication channel 100C. Output from kernel 20C is provided, asinput, to kernel via inter-kernel communication channel 100D.

Each of the channels 100A, 100B, 100C, and 100D may be automaticallysized based upon one or more factors. A variety of channel 100A-100Dimplementations may be implemented. For example, channels 100A-100D maybe implemented on the IC using registers, using low-latency components,using high-latency components, using Block random access memory (RAM)(e.g., dedicated RAM), etc. The latency of the channels 100A-100D mayvary, depending on the architecture of the implementation of thesechannels 100A-100D. The latency of the channels 100A-100D may impactthroughput, and thus, is one implementation factor that may be used forautomatic sizing of the channels 100A-100D. Latency is defined herein asthe number of cycles it takes for the data of a write of a channel(e.g., channel 100A-100D) to be read at the other end of the channel100A-100D. In other words, the latency is the number of cycles it takesfor a “not-full” state to propagate to the write site of a channel(e.g., channel 100A-100D). To insure proper sizing of the channels100A-100D, the depth of the channels 100A-100D may be sized such thattheir depth is greater than the latency of the channels 100A-100D. Forexample, the compiler (e.g., compiler 16 of FIG. 1 ) may determine alatency of the channel 100A-100D implementations and/or may retrieve aknown latency of the channels 100A-100D based upon an ascribed latencyfor channel implementations. The compiler may ensure that the sizing ofthe channels 100A-100D are greater than their corresponding latencies.This may help to ensure that data is not requested prior to a time whenit propagates to the other end of the channels 100A-100D. In someembodiments, the compiler may first determine a desirable channel depthand select a channel implementation based on the determined desirablechannel depth (e.g., by selecting a channel implementation that has alower latency than the desirable channel depth). Regardless of whetherthe implementation latency determines the depth of the channel or thedesired channel depth determines the channel implementation, thecompiler may maintain a relationship where the depth of the channels isgreater than the latency of the implementation.

Alternative factors for automatic sizing may include predication and/orscheduling imbalances. For example, channel implementation factors, suchas the channel 100A-100D latency may impact throughput of the channels100A-100D. Predication (the channel read and/or writes are not executedevery execution cycle) may affect throughput in inter-kernelcommunication. For example, stalls may occur when an attempt is made towrite into a full channel 100A-100D. The length of the stall is the timeit takes for the channel 100A-100D to become “not-full” at the locationwhere the write is to occur (e.g., the latency of the channel100A-100D). To counteract stalls, extra depth may be automatically addedto the channels 100A-100D to account for the latency of the channels100A-100D, as will be discussed in more detail with regard to FIG. 5 .

Turning now to a predication example 120, FIG. 5 is a block diagramillustrating automatic sizing of channels 122A and/or 122B based uponpredication, in accordance with an embodiment. In the example 120, thekernels 124A and 124B are communicatively connected viaFIFO-buffer-based channels 122A and 122B. The kernels 122A and 122B usethe same selection logic 126 for the multiplexer (MUX) 128 of kernel124A and the de-multiplexer (DEMUX) 130 of kernel 124B.

The MUX 128 and DEMUX 130 illustrate shared predication logic on the twochannels 122A and 122B. For example, the selected outputs of the MUX 128are provided to the DEMUX 130 via either channel 122A or channel 122B.Because the selection logic 126 may result in reads and/or writes of thechannels 122A and/or 122B not executing every cycle (e.g., they arepredicated), stalls may occur (e.g., when attempting to write into afull channel 122A and/or 122B. For instance, in the current example,each of the channels 122A and 122B has a capacity of 5 elements, asillustrated by the element containers 126. If the selection logicresults in the first five elements being written to channel 122A, thesixth element being written to channel 122B, and the seventh elementback to channel 122A, a stall will occur at the seventh write. The stalloccurs because the seventh element cannot be written to channel 122A,which is full with elements 1-5, because kernel 124B received data fromchannel 122B when the sixth element was written, due to the selectionlogic 126 for the MUX 128 and DEMUX 130 being the same.

In other words, a control signal from the channel 122A indicating thatit is “not full” will not reach the kernel 124A prior to the attempt towrite the seventh data element, due to latency of the channel 122A.Accordingly, when the kernel 124A attempts to write the seventh dataelement, it will see the channel 122A as full, resulting in a stall.

To counteract the stalls, the channels 122A and/or 122B may beautomatically sized (e.g., via the channel sizing logic 17 of FIG. 1 )to include enough space for the implemented channel capacity (e.g.,here, five elements) plus the latency of the channel 122A and/or 122B.As mentioned above, the latency of the channels 122A and/or 122B isdefined at the number of cycles it takes for the data of a write to achannel 122A and/or 122B to be read at the other end of the channel 122Aand/or 122B. In other words, the latency is the time it takes for the“not-full” state to propagate to the write site of a channel.Accordingly, in the current example, additional elements may be added tothe channels 122A, because the selection logic 126, on the seventh writeattempt, selects data from the channel 122A for reading at the DEMUX130, resulting in an empty element container 126 in one cycle. By addingthe latency to the implemented capacity, potential stalls due topredication may be avoided.

Automatic channel sizing may also account for scheduling imbalances inthe kernels, such that throughput efficiencies may be realized. Asmentioned above, each of the kernels may be independently balanced.Indeed, small portions of the kernels may be individually scheduled tocreate an efficient runtime. Because each of the kernels includes itsown schedulings and/or latencies and because scheduling all of theinter-communicating kernels together as a single entity would result insignificant runtime increases, the channels may be sized to accommodatekernel-based scheduling imbalances.

FIG. 6 is a block diagram illustrating an example 140 of automaticsizing of channels based upon scheduling imbalances, in accordance withan embodiment. FIG. 7 is a process for solving an integer linearprogramming problem to define a depth of the channels. FIGS. 6 and 7will be discussed together for clarity.

In the example 140 of FIG. 6 , a first kernel 142 provides data to asecond kernel 142B via two channels (e.g., FIFO-buffer-based channels144A and 144B). There is latency 146 between kernel 142A writes to thebuffers 144A and 144B. For example, as data flows through kernel 142A(as illustrated by the arrow 148), data is first written to the channel144A (as illustrated by the arrow 150). Data continues to flow for thelatency 146 period (as illustrated by arrow 148) and a second piece ofdata is written to the channel 144B (as illustrated by arrow 152).

As may be appreciated, there is also a corresponding latency 154 betweenkernel 142B's reading of data from the channels 144A and 144B. Forexample, as data flows in kernel 142B (as illustrated by arrow 156),data is first read from the channel 144A (as indicated by arrow 150).Data flow continues for the latency 154 period (as illustrated by arrow156). After the latency 154 period, a second data read occurs from thechannel 144B (as illustrated by arrow 152).

To size the channels, the compiler may first calculate the maximumlatency and the minimum capacity for each endpoint of the channel (e.g.,each read and write site) (block 172 of FIG. 7 ). This latency is theamount of time it would take a thread to reach this endpoint from thestart of the kernel. The minimum capacity is the minimum number ofthreads that could be live along that path prior to the endpoint. Inother words, the latency is the amount of time it takes a thread toreach a certain point and the capacity is the number of threads that canbe in a pipeline.

Next, a variable is used to represent any scheduling slack for thekernel (block 174 of FIG. 7 ). The slack may represent a delayed startof the kernel relative to other kernels. As the kernels start up, theremay be some initial stalls, because kernels may be waiting for initialdata to be processed by a predecessor kernel. However, in steady-stateoperation, stalls may be minimized and/or removed.

Next, a constraint is added for each channel (block 176 of FIG. 7 ). Theconstraint states that the slack for the kernel on the read side of thechannel minus the slack of the kernel on the write side should begreater than or equal to the maximum latency it takes to get to the readminus the minimum capacity on the write side. In other words, thisconstraint calculates the number of threads that need to be held in thechannel, in the worst case, when one kernel is able to consume morethreads than another.

A cost function is then calculated for each pipeline, using the width ofthe channel (block 178 of FIG. 7 ). For example, if one channel sends 32bits of data and another sends 512 bits of data, it would be much moreexpensive to create depth on the 512 bits of data.

The depths of the channels may then be set (block 180 of FIG. 7 ). Thedepth may be the relative difference between the read and writeendpoints of the kernels plus the difference between the maximum latencyand the minimum capacity.

FIG. 8 is a block diagram 200 illustrating automatic sizing of channelsusing integer linear programming (ILP), in accordance with anembodiment. The cost function may be minimized, resulting in animplementation that uses a minimal area of the IC programmable logic.

In the block diagram 200, there are two kernels k₁ 202A and k₂ 202B.Channel 204A is named FIFO A and has a width of 32 bits. Channel 204B isnamed FIFO B and has a width of 16 bits. Point A_(w) is where writes toFIFO A 204A occur. Point A_(r) is where reads from FIFO A 204A occur.Point B_(w) is where writes to FIFO B 204B occur and point B_(r) iswhere reads from FIFO B occur. The format m(n) may represent the maximumlatency and the minimum capacity to the specific point in the kernel202A and/or 202B. For example 5(10) may represent a maximum latency of55 and a minimum capacity of 10 at a particular point. Thus, at A_(w),both the maximum latency and the minimum capacity are 1. At B_(w), themaximum latency is 1 and the minimum capacity is 10. At A_(r), themaximum latency is 5 and the minimum capacity is 1. At B_(r), themaximum latency and the minimum capacity are both 5. These values may bedetermined, for example, by the compiler during runtime.

To solve the ILP problem, the cost of kernel k₁ 202A is determined as−32+(−16)=−48 and the cost of kernel k₂ 202B is determined as 32+16=48.The cost function (−48k₁+48k₂) is then minimized. FIFO A channel 204Aconstraint (k₂−k₁>=5−1>=−4) is added. Additionally FIFO B channel 204Bconstraint (k₂−k₁>=10−5>=5) is added. Then, to make the problemsolvable, a dummy node (e.g. “source”) is created and additionalconstraints k₁−source>=1000000 and source−k₂>=1000000 are added. Whilethe current example uses 1000000, any large cost factor may be used. Thecost factor may be large, such that they have a negligible effect on thesolution of this equation. Then, the ILP problem is solved to getk₂−k₁=5. This difference is used in the depth calculation of FIFO Achannel 204A and FIFO B channel 204B. The depth of FIFO A channel 204Amay be set to 5+5−1=9 and the depth of FIFO B channel 204B may be set to5+5−10=0.

By implementing the automatic channel sizing logic, inter-kernel channelcommunication throughput may be enhanced. For example, if the sizing ofthe channels does not account for implementation factors, predication,and/or scheduling imbalances, a write attempt may occur to a fullinter-kernel communication channel. This may result in an unnecessarydata stall, reducing throughput. Accordingly, by allowing the compiler(or other component) automatically size these channels based upon thevarious implementation factors, predication, and/or schedulingimbalances, throughput efficiencies may be obtained.

While the embodiments set forth in the present disclosure may besusceptible to various modifications and alternative forms, specificembodiments have been shown by way of example in the drawings and havebeen described in detail herein. However, it should be understood thatthe disclosure is not intended to be limited to the particular formsdisclosed. The disclosure is to cover all modifications, equivalents,and alternatives falling within the spirit and scope of the disclosureas defined by the following appended claims.

What is claimed is:
 1. A tangible, non-transitory,machine-readable-medium, comprising machine readable instructions to:access a high level program comprising instructions to implement adesign on an integrated circuit; convert the high level program into alow level program to be implemented on the integrated circuit; identifya first kernel, a second kernel, and an inter-kernel channel based onthe low level program or based on the high level program, wherein thefirst kernel writes data to the inter-kernel channel and the secondkernel reads the data from the inter-kernel channel; identify apotential stall due to expected writes to the inter-kernel channel bythe first kernel, an indication of expected reads from the inter-kernelchannel by the second kernel, or both, wherein the expected writes ofthe inter-kernel channel comprises, during a first cycle, the firstkernel writing to the inter-kernel channel, and, during a second cycle,the first kernel not writing to the inter-kernel channel, and whereinthe expected reads of the inter-kernel channel comprises, during a thirdcycle, the second kernel reading from the inter-kernel channel, and,during a fourth cycle, the second kernel not reading from theinter-kernel channel; modify a size of the inter-kernel channel, basedon a determination of a number of elements to be held in theinter-kernel channel to prevent the potential stall when the firstkernel writes more elements to the inter-kernel channel than the secondkernel reads or the second kernel reads more elements from theinter-kernel channel than the first kernel writes, wherein the size isbased on the expected writes of the inter-kernel channel, the expectedreads of the inter-kernel channel, or both; and provide one or morekernels and the inter-kernel channel corresponding to the low levelprogram for implementation on the integrated circuit, wherein the one ormore kernels include the first and second kernels.
 2. The tangible,non-transitory, machine-readable medium of claim 1, wherein theinter-kernel channel comprises a first-in, first-out buffer.
 3. Thetangible, non-transitory, machine-readable medium of claim 1, whereinthe instructions are further to: identify, via the compiler, a secondinter-kernel channel, wherein the first kernel writes to the secondinter-kernel channel, and the second kernel reads from the secondinter-kernel channel.
 4. The tangible, non-transitory, machine-readablemedium of claim 1, wherein the size of the inter-kernel channelcomprises a depth of the inter-kernel channel.
 5. The tangible,non-transitory, machine-readable medium of claim 1, wherein inter-kernelchannel comprises at least one control signal, and wherein the at leastone control signal is provided to the first kernel, the second kernel,or both.
 6. The tangible, non-transitory, machine-readable medium ofclaim 5, wherein the at least one control signal comprises a not fullsignal that indicates whether a buffer of the inter-kernel channel isfull.
 7. The tangible, non-transitory, machine-readable medium of claim1, wherein the instructions are further to identify the potential stall,wherein the potential stall would occur because a buffer of theinter-kernel channel is full.
 8. The tangible, non-transitory,machine-readable medium of claim 1, wherein the instructions are furtherto identify the potential stall, wherein the potential stall would occurbecause a buffer of the inter-kernel channel is empty.
 9. The tangible,non-transitory, machine-readable medium of claim 1, wherein the firstcycle and the fourth cycle occur simultaneously, and the second cycleand the third cycle occur simultaneously.
 10. The tangible,non-transitory, machine-readable medium of claim 1, wherein the lowlevel program provided for storage on the integrated circuit has themodified size of the inter-kernel channel.
 11. The tangible,non-transitory, machine-readable medium of claim 1, wherein the expectedwrites comprises an indication of a predication of the expected writes.12. The tangible, non-transitory, machine-readable medium of claim 1,wherein the expected writes comprises an indication of performance ofconditional branches containing the expected writes.
 13. The tangible,non-transitory, machine-readable medium of claim 1, wherein the expectedreads comprises an indication of a predication of the expected writes.14. The tangible, non-transitory, machine-readable medium of claim 1,wherein the expected writes comprises an indication of performance ofconditional branches containing the expected reads.
 15. The tangible,non-transitory, machine-readable medium of claim 1, wherein thepotential stall would be caused due to predication of the expectedwrites or the expected reads.
 16. The tangible, non-transitory,machine-readable medium of claim 1, wherein the modification of the sizeof the inter-kernel channel is based at least in part on predication ofthe expected writes, the expected reads, or both.
 17. A system,comprising: an integrated circuit; memory storing instructions; and aprocessor to execute the instructions to: implement a compiler; access,via the compiler, a high level program comprising instructions toimplement a design on the integrated circuit; convert, via the compiler,the high level program into a low level program to be implemented on theintegrated circuit; identify, via the compiler, a first kernel, a secondkernel, and an inter-kernel channel based on the low level program orbased on the high level program, wherein the first kernel writes data tothe inter-kernel channel and the second kernel reads the data from theinter-kernel channel; identify, via the compiler, a potential stall dueto expected writes to the inter-kernel channel by the first kernel,expected reads from the inter-kernel channel by the second kernel, orboth, wherein the expected writes of the inter-kernel channel comprises,during a first cycle, the first kernel writing to the inter-kernelchannel, and, during a second cycle, the first kernel not writing to theinter-kernel channel, and wherein the expected reads of the inter-kernelchannel comprises, during a third cycle, the second kernel reading fromthe inter-kernel channel, and, during a fourth cycle, the second kernelnot reading from the inter-kernel channel; modify, via the compiler, asize of the inter-kernel channel, based on a determination of a numberof elements to be held in the inter-kernel channel to prevent theexpected stall when the first kernel writes more elements to theinter-kernel channel than the second kernel reads or the second kernelreads more elements from the inter-kernel channel than the first kernelwrites, wherein the size is based on the expected writes of theinter-kernel channel, the expected reads of the inter-kernel channel, orboth; and provide, via the compiler, one or more kernels and theinter-kernel channel corresponding to the low level program forimplementation on the integrated circuit, wherein the one or morekernels comprise the first kernel and the second kernel.
 18. The systemof claim 17, wherein the integrated circuit comprises afield-programmable, gate array.
 19. The system of claim 17, wherein thecompiler is part of a design software package executed by the processor.20. The system of claim 17, wherein the instructions are further toidentify the potential stall, wherein the potential stall would occurbecause a buffer of the inter-kernel channel is empty or full.